26 research outputs found
FoleyGen: Visually-Guided Audio Generation
Recent advancements in audio generation have been spurred by the evolution of
large-scale deep learning models and expansive datasets. However, the task of
video-to-audio (V2A) generation continues to be a challenge, principally
because of the intricate relationship between the high-dimensional visual and
auditory data, and the challenges associated with temporal synchronization. In
this study, we introduce FoleyGen, an open-domain V2A generation system built
on a language modeling paradigm. FoleyGen leverages an off-the-shelf neural
audio codec for bidirectional conversion between waveforms and discrete tokens.
The generation of audio tokens is facilitated by a single Transformer model,
which is conditioned on visual features extracted from a visual encoder. A
prevalent problem in V2A generation is the misalignment of generated audio with
the visible actions in the video. To address this, we explore three novel
visual attention mechanisms. We further undertake an exhaustive evaluation of
multiple visual encoders, each pretrained on either single-modal or multi-modal
tasks. The experimental results on VGGSound dataset show that our proposed
FoleyGen outperforms previous systems across all objective metrics and human
evaluations
Stack-and-Delay: a new codebook pattern for music generation
In language modeling based music generation, a generated waveform is
represented by a sequence of hierarchical token stacks that can be decoded
either in an auto-regressive manner or in parallel, depending on the codebook
patterns. In particular, flattening the codebooks represents the highest
quality decoding strategy, while being notoriously slow. To this end, we
propose a novel stack-and-delay style of decoding strategy to improve upon the
flat pattern decoding where generation speed is four times faster as opposed to
vanilla flat decoding. This brings the inference time close to that of the
delay decoding strategy, and allows for faster inference on GPU for small batch
sizes. For the same inference efficiency budget as the delay pattern, we show
that the proposed approach performs better in objective evaluations, almost
closing the gap with the flat pattern in terms of quality. The results are
corroborated by subjective evaluations which show that samples generated by the
new model are slightly more often preferred to samples generated by the
competing model given the same text prompts
Exploring Speech Enhancement for Low-resource Speech Synthesis
High-quality and intelligible speech is essential to text-to-speech (TTS)
model training, however, obtaining high-quality data for low-resource languages
is challenging and expensive. Applying speech enhancement on Automatic Speech
Recognition (ASR) corpus mitigates the issue by augmenting the training data,
while how the nonlinear speech distortion brought by speech enhancement models
affects TTS training still needs to be investigated. In this paper, we train a
TF-GridNet speech enhancement model and apply it to low-resource datasets that
were collected for the ASR task, then train a discrete unit based TTS model on
the enhanced speech. We use Arabic datasets as an example and show that the
proposed pipeline significantly improves the low-resource TTS system compared
with other baseline methods in terms of ASR WER metric. We also run empirical
analysis on the correlation between speech enhancement and TTS performances.Comment: Submitted to ICASSP 202
Enhance audio generation controllability through representation similarity regularization
This paper presents an innovative approach to enhance control over audio
generation by emphasizing the alignment between audio and text representations
during model training. In the context of language model-based audio generation,
the model leverages input from both textual and audio token representations to
predict subsequent audio tokens. However, the current configuration lacks
explicit regularization to ensure the alignment between the chosen text
representation and the language model's predictions. Our proposal involves the
incorporation of audio and text representation regularization, particularly
during the classifier-free guidance (CFG) phase, where the text condition is
excluded from cross attention during language model training. The aim of this
proposed representation regularization is to minimize discrepancies in audio
and text similarity compared to other samples within the same training batch.
Experimental results on both music and audio generation tasks demonstrate that
our proposed methods lead to improvements in objective metrics for both audio
and music generation, as well as an enhancement in the human perception for
audio generation.Comment: 5 page
Scaling Speech Technology to 1,000+ Languages
Expanding the language coverage of speech technology has the potential to
improve access to information for many more people. However, current speech
technology is restricted to about one hundred languages which is a small
fraction of the over 7,000 languages spoken around the world. The Massively
Multilingual Speech (MMS) project increases the number of supported languages
by 10-40x, depending on the task. The main ingredients are a new dataset based
on readings of publicly available religious texts and effectively leveraging
self-supervised learning. We built pre-trained wav2vec 2.0 models covering
1,406 languages, a single multilingual automatic speech recognition model for
1,107 languages, speech synthesis models for the same number of languages, as
well as a language identification model for 4,017 languages. Experiments show
that our multilingual speech recognition model more than halves the word error
rate of Whisper on 54 languages of the FLEURS benchmark while being trained on
a small fraction of the labeled data